In this last lecture we’ll introduce a couple of new concepts: cross validation and feature selection.
library(tidyverse)
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'pillar'
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'hms'
## Warning: package 'tibble' was built under R version 4.1.2
library(tidymodels)
## Warning: package 'recipes' was built under R version 4.1.2
library(knitr)
library(plotly)
library(modelr)
We’re going to work with a subset of the data that only includes a few variables, which are selected below.
mv<-readRDS("mv.Rds")%>%
filter(!is.na(budget))%>%
mutate(log_gross=log(gross))%>%
mutate(year=as_factor(year))%>%
select(
title,
log_gross,
budget,
rating,
genre,
runtime,
year)%>%
drop_na()
As usual, we’ll split the data 75/25.
split_data<-initial_split(mv)
mv_train<-training(split_data)
mv_test<-testing(split_data)
The essence of prediction is discovering the extent to which our models can predict outcomes for data that does not come from our sample. Many times this process is temporal. We fit a model to data from one time period, then take predictors from a subsequent time period to come up with a prediction in the future. For instance, we might use data on team performance to predict the likely winners and losers for upcoming soccer games.
This process does not have to be temporal. We can also have data that is out of sample because it hadn’t yet been collected when our first data was collected, or we can also have data that is out of sample because we designated it as out of sample.
The data that is used to generate our predictions is known as training data. The idea is that this is the data used to train our model, to let it know what the relationship is between our predictors and our outcome. So far, we have worked mostly with training data.
That data that is used to validate our predictions is known as testing data. With testing data, we take our trained model and see how good it is at predicting outcomes using out of sample data.
One very simple approach to this would be to cut our data in half. This is what we’ve done so far. We could then train our model on half the data, then test it on the other half. This would tell us whether our measure of model fit (e.g. rmse, auc) is similar or different when we apply our model to out of sample data.
But this would only be a “one-shot” approach. It would be better to do this multiple times, cutting the data into two parts: training and testing, then fitting the model to the training data, and then checking its predictions against the testing data. That way, we could generate a large number of rmse’s to see how well the model fits on lots of different possible out-of-sample predictions.
This process is called cross validation, and it involves two important decisions: first, how will the data be cut, and how many times will the validation run.
We’re going to cut our training dataset 75/25, and we’ll repeat that 25 times. This is so our code will run faster– we would really want to do this more like 1,000 times in practice.
mv_rs<-mc_cv(mv_train,times=25) ## More like 1000 in practice
One of the key decisions for an analyst is which variables to include. We can make decisions about this using theory, or our understanding of the context, but we can also rely on computational approaches. This is known as regularization and it involves downweighting the importance of coefficients from a model based on the contribution that a predictor makes. We’re going to make use of a regularization penalty known as the “lasso.” The lasso downweights variables mostly be dropping variables that are highly correlated with one another, leaving only one of the correlated variables as contributors to the model. We set the degree to which this penalty will be implemented by setting the “penalty” variable in the model specification.
penalty_spec<-.1
mixture_spec<-1
lasso_fit<-
linear_reg(penalty=penalty_spec,
mixture=mixture_spec) %>%
set_engine("glmnet")%>%
set_mode("regression")
movie_wf<-workflow()
movie_wf<-movie_wf%>%
add_model(lasso_fit)
In setting the recipe for this model, we’re now going to include ever variable in the dataset. This is very common in these kinds of applications.
movie_formula<-as.formula("log_gross~.")
Because we have so many predictors, we need to generalize our process for feature engineering. Instead of running steps on particular variables, we’re going to use the capabilities of tidymodels to select types of variables.
movie_rec<-recipe(movie_formula,mv)%>%
update_role(title,new_role="id variable")%>%
update_role(log_gross,new_role="outcome")%>% ## specify dv
step_log(budget)%>%
step_other(all_nominal_predictors(),threshold = .01)%>%
step_dummy(all_nominal_predictors())%>%
step_normalize(all_predictors())%>% ## Convert all to Z scores
step_naomit(all_predictors()) ## drop missing
To see what this does, we can use the prep and
bake commands
mv_processed<-movie_rec%>%prep()%>%bake(mv_train)
Now we can add our model to the recipe.
movie_wf<-movie_wf%>%
add_recipe(movie_rec)
To fit a model to resampled data, we use fit_resamples.
It’s now going to fit the model to the training data and then test the
model against the testing data for all 25 of our resampled datasets.
movie_lasso_fit<-movie_wf%>%
fit_resamples(mv_rs)
## Warning: package 'rlang' was built under R version 4.1.2
## Warning: package 'vctrs' was built under R version 4.1.2
## Warning: package 'glmnet' was built under R version 4.1.2
Once we run this model, we get a measure of model fit from every
testing dataset. collect_metrics will let us see the
average.
movie_lasso_fit%>%
collect_metrics()
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 1.27 25 0.0111 Preprocessor1_Model1
## 2 rsq standard 0.508 25 0.00545 Preprocessor1_Model1
movie_lasso_fit%>%
unnest(.metrics)%>%
filter(.metric=="rmse")%>%
ggplot(aes(x=.estimate))+
geom_density()
movie_lasso_final <- finalize_workflow(movie_wf,
select_best(movie_lasso_fit,metric="rmse")) %>%
fit(mv_test)
movie_lasso_final%>%
extract_fit_parsnip()%>%
tidy()%>%
kable()
| term | estimate | penalty |
|---|---|---|
| (Intercept) | 17.9099703 | 0.1 |
| budget | 1.0744746 | 0.1 |
| runtime | 0.0183170 | 0.1 |
| rating_PG | 0.0000000 | 0.1 |
| rating_PG.13 | 0.0000000 | 0.1 |
| rating_R | -0.1274638 | 0.1 |
| rating_other | 0.0000000 | 0.1 |
| genre_Adventure | 0.0000000 | 0.1 |
| genre_Animation | 0.0153231 | 0.1 |
| genre_Biography | 0.0000000 | 0.1 |
| genre_Comedy | 0.0000000 | 0.1 |
| genre_Crime | 0.0000000 | 0.1 |
| genre_Drama | 0.0000000 | 0.1 |
| genre_Horror | 0.0000000 | 0.1 |
| genre_other | -0.0021133 | 0.1 |
| year_X2001 | 0.0000000 | 0.1 |
| year_X2002 | 0.0000000 | 0.1 |
| year_X2003 | 0.0000000 | 0.1 |
| year_X2004 | 0.0000000 | 0.1 |
| year_X2005 | 0.0000000 | 0.1 |
| year_X2006 | 0.0000000 | 0.1 |
| year_X2007 | 0.0000000 | 0.1 |
| year_X2008 | 0.0000000 | 0.1 |
| year_X2009 | 0.0000000 | 0.1 |
| year_X2010 | 0.0000000 | 0.1 |
| year_X2011 | 0.0000000 | 0.1 |
| year_X2012 | 0.0000000 | 0.1 |
| year_X2013 | 0.0000000 | 0.1 |
| year_X2014 | 0.0000000 | 0.1 |
| year_X2015 | 0.0000000 | 0.1 |
| year_X2016 | 0.0000000 | 0.1 |
| year_X2017 | 0.0000000 | 0.1 |
| year_X2018 | 0.0000000 | 0.1 |
| year_X2019 | 0.0000000 | 0.1 |
| year_other | 0.0000000 | 0.1 |
movie_lasso_final%>%last_fit(split_data)%>%
collect_metrics()
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 1.22 Preprocessor1_Model1
## 2 rsq standard 0.519 Preprocessor1_Model1
The problem with the above is that I arbitrarily set the value of penalty to be .1. Do I know this was correct? No! What we need to do is try out a bunch of different values of the penalty, and see which one gives us the best model fit. This process has the impressive name of “hyperparameter tuning” but it could just as easily be called “trying a bunch of stuff to see what works.”
Below I’m going to give the argument tune() for the
value of penalty. This will allow us to “fill in” values later.
lasso_tune_fit<-
linear_reg(penalty=tune(),mixture=mixture_spec)%>%
set_engine("glmnet")
movie_wf<-movie_wf%>%
update_model(lasso_tune_fit)
A tuning grid is a set of numbers we might want to use. I use the
function grid_regular and ask it to give me 10 possible
values of the penalty.
lasso_grid<-grid_regular(parameters(lasso_tune_fit) ,levels=10)
movie_lasso_tune_fit <-
movie_wf %>%
tune_grid(mv_rs,grid=lasso_grid)
Lets’ take a look and see which models fit better.
movie_lasso_tune_fit%>%
collect_metrics()%>%
filter(.metric=="rmse")%>%
arrange(mean)
## # A tibble: 10 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.00599 rmse standard 1.23 25 0.0104 Preprocessor1_Model08
## 2 0.000464 rmse standard 1.23 25 0.0103 Preprocessor1_Model07
## 3 0.0000000001 rmse standard 1.23 25 0.0103 Preprocessor1_Model01
## 4 0.00000000129 rmse standard 1.23 25 0.0103 Preprocessor1_Model02
## 5 0.0000000167 rmse standard 1.23 25 0.0103 Preprocessor1_Model03
## 6 0.000000215 rmse standard 1.23 25 0.0103 Preprocessor1_Model04
## 7 0.00000278 rmse standard 1.23 25 0.0103 Preprocessor1_Model05
## 8 0.0000359 rmse standard 1.23 25 0.0103 Preprocessor1_Model06
## 9 0.0774 rmse standard 1.25 25 0.0109 Preprocessor1_Model09
## 10 1 rmse standard 1.65 25 0.0166 Preprocessor1_Model10
It looks like using a very small penalty (like basically none) is the way to go.
Let’s confirm what we learned by plotting the results.
movie_lasso_tune_fit%>%
unnest(.metrics)%>%
filter(.metric=="rmse")%>%
mutate(tune_id=paste0("penalty=",prettyNum(penalty,digits=4))) %>%
select(tune_id,.estimate)%>%
rename(RMSE=.estimate)%>%
ggplot(aes(x=RMSE,color=tune_id,fill=tune_id))+
geom_density(alpha=.1)
What this is telling us is that big penalty values are a really bad idea, and that most of the lower penalty values fit the data just fine.
We can choose the best model and then fit it to our training dataset.
movie_final<-
finalize_workflow(movie_wf,
select_best(movie_lasso_tune_fit,
metric="rmse"))%>%
fit(mv_train)
movie_final%>%
extract_fit_parsnip()%>%
tidy()%>%
mutate(penalty=prettyNum(penalty,digits=4))%>%
kable()
| term | estimate | penalty |
|---|---|---|
| (Intercept) | 17.9333812 | 0.005995 |
| budget | 0.9966007 | 0.005995 |
| runtime | 0.2036747 | 0.005995 |
| rating_Not.Rated | -0.1954680 | 0.005995 |
| rating_PG | 0.0049724 | 0.005995 |
| rating_PG.13 | -0.0568851 | 0.005995 |
| rating_R | -0.2914045 | 0.005995 |
| rating_other | -0.1112207 | 0.005995 |
| genre_Adventure | 0.0000000 | 0.005995 |
| genre_Animation | 0.1151601 | 0.005995 |
| genre_Biography | -0.1214634 | 0.005995 |
| genre_Comedy | -0.0119482 | 0.005995 |
| genre_Crime | -0.0566953 | 0.005995 |
| genre_Drama | -0.1111199 | 0.005995 |
| genre_Horror | 0.2131577 | 0.005995 |
| genre_other | 0.0041552 | 0.005995 |
| year_X2001 | 0.0000000 | 0.005995 |
| year_X2002 | -0.0023208 | 0.005995 |
| year_X2003 | 0.0000000 | 0.005995 |
| year_X2004 | 0.0213357 | 0.005995 |
| year_X2005 | -0.0000697 | 0.005995 |
| year_X2006 | -0.0170610 | 0.005995 |
| year_X2007 | 0.0316901 | 0.005995 |
| year_X2008 | 0.0264023 | 0.005995 |
| year_X2009 | -0.0016074 | 0.005995 |
| year_X2010 | 0.0066179 | 0.005995 |
| year_X2011 | 0.0468785 | 0.005995 |
| year_X2012 | -0.0043243 | 0.005995 |
| year_X2013 | 0.0658667 | 0.005995 |
| year_X2014 | 0.0402270 | 0.005995 |
| year_X2015 | 0.0143962 | 0.005995 |
| year_X2016 | 0.0181336 | 0.005995 |
| year_X2017 | 0.0983690 | 0.005995 |
| year_X2018 | 0.0721345 | 0.005995 |
| year_X2019 | 0.0425259 | 0.005995 |
| year_other | 0.0358267 | 0.005995 |
This is almost identical to what we would have gotten had we just used OLS, but we know now for certain!
pred_df<-movie_final%>%
predict(mv_test)%>%
rename(`Predicted Gross`=.pred)%>%
bind_cols(mv_test)%>%
rename(`Actual Gross`=log_gross)
rmse_final<-yardstick::rmse(pred_df,truth = `Actual Gross`,estimate = `Predicted Gross`)
rmse_final
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1.20
gg<-pred_df%>%
ggplot(aes(y=`Actual Gross`,x=`Predicted Gross`,text=paste(title,"<br>",
`Actual Gross`,
`Predicted Gross`)))+
geom_point(alpha=.25,size=.5)+
scale_x_continuous(trans="log",labels=label_dollar())+
scale_y_continuous(trans="log",labels=label_dollar())
ggplotly(gg,tooltip="text")
Or Just
movie_final%>%last_fit(split_data)%>%
collect_metrics()
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 1.20 Preprocessor1_Model1
## 2 rsq standard 0.529 Preprocessor1_Model1
Once we’re sure that we at our very last model, we can get estimates from the full dataset.
movie_final<-
finalize_workflow(movie_wf,
select_best(movie_lasso_tune_fit,
metric="rmse"))%>%
fit(mv) ## FULL dataset
movie_final%>%
pull_workflow_fit()%>%
tidy()%>%
mutate(penalty=prettyNum(penalty))%>%
kable()
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.
| term | estimate | penalty |
|---|---|---|
| (Intercept) | 17.9275230 | 0.005994843 |
| budget | 1.0125321 | 0.005994843 |
| runtime | 0.2042725 | 0.005994843 |
| rating_Not.Rated | -0.1965304 | 0.005994843 |
| rating_PG | 0.0174066 | 0.005994843 |
| rating_PG.13 | -0.0295522 | 0.005994843 |
| rating_R | -0.2588306 | 0.005994843 |
| rating_other | -0.0781389 | 0.005994843 |
| genre_Adventure | 0.0000000 | 0.005994843 |
| genre_Animation | 0.1150546 | 0.005994843 |
| genre_Biography | -0.1007891 | 0.005994843 |
| genre_Comedy | -0.0051435 | 0.005994843 |
| genre_Crime | -0.0556307 | 0.005994843 |
| genre_Drama | -0.0972500 | 0.005994843 |
| genre_Horror | 0.1880654 | 0.005994843 |
| genre_other | -0.0035946 | 0.005994843 |
| year_X2001 | -0.0076118 | 0.005994843 |
| year_X2002 | 0.0000000 | 0.005994843 |
| year_X2003 | 0.0128948 | 0.005994843 |
| year_X2004 | 0.0298426 | 0.005994843 |
| year_X2005 | 0.0003864 | 0.005994843 |
| year_X2006 | 0.0000000 | 0.005994843 |
| year_X2007 | 0.0321195 | 0.005994843 |
| year_X2008 | 0.0175200 | 0.005994843 |
| year_X2009 | 0.0000000 | 0.005994843 |
| year_X2010 | 0.0000000 | 0.005994843 |
| year_X2011 | 0.0537007 | 0.005994843 |
| year_X2012 | 0.0185697 | 0.005994843 |
| year_X2013 | 0.0571450 | 0.005994843 |
| year_X2014 | 0.0598982 | 0.005994843 |
| year_X2015 | 0.0342997 | 0.005994843 |
| year_X2016 | 0.0250014 | 0.005994843 |
| year_X2017 | 0.0886255 | 0.005994843 |
| year_X2018 | 0.0672555 | 0.005994843 |
| year_X2019 | 0.0569767 | 0.005994843 |
| year_other | 0.0380641 | 0.005994843 |
This is what we would use for any incoming data.Let’s say the proposal is to make either a horror or adventure movie for 10 million. We’ve also been pitched movies that will be rated R– which of course better reflects the director’s “artistic vision” and movies that will be rated PG-13. Let’s see what our model says about these.
newdata<-data_grid(mv,
title="Data Science 3: The Tuning",
budget=1e7,
rating=c("R","PG-13"),
genre=c("Horror","Adventure"),
runtime=100,
year=as_factor(2020))
movie_final%>%
predict(newdata)%>%
bind_cols(newdata)%>%
mutate(low_dollar_amount=dollar(exp(.pred-rmse_final$.estimate)))%>%
mutate(mean_dollar_amount=dollar(exp(.pred)))%>%
mutate(hi_dollar_amount=dollar(exp(.pred+rmse_final$.estimate)))
## # A tibble: 4 × 10
## .pred title budget rating genre runtime year low_dollar_amou…
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <fct> <chr>
## 1 17.8 Data Science 3: The … 1e7 PG-13 Adve… 100 2020 $15,537,594
## 2 18.6 Data Science 3: The … 1e7 PG-13 Horr… 100 2020 $38,009,609
## 3 17.3 Data Science 3: The … 1e7 R Adve… 100 2020 $9,796,952
## 4 18.2 Data Science 3: The … 1e7 R Horr… 100 2020 $23,966,279
## # … with 2 more variables: mean_dollar_amount <chr>, hi_dollar_amount <chr>
What should we recommend as a result?